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Abstract 

Integrating the outputs of multiple classifiers via combiners or meta-learners has led 
to substantial improvements in several difficult pattern recognition problems. In this 
article we investigate a family of combiners based on order statistics, for robust handling 
of situations where there are large discrepancies in performance of individual classifiers. 
Based on a mathematical modeling of how the decision boundaries are affected by order 
statistic combiners, we derive expressions for the reductions in error expected when 
simple output combination methods based on the the median, the maximum and in 
general, the i th order statistic, are used. Furthermore, we analyze the trim and spread 
combiners, both based on linear combinations of the ordered classifier outputs, and show 
that in the presence of uneven classifier performance, they often provide substantial 
gains over both linear and simple order statistics combiners. Experimental results on 
both real world data and standard public domain data sets corroborate these findings. 

Keywords: Ensembles, order statistics, trimmed means, classification error, robust 
statistics. 

1 Introduction 

Since different types of classifiers have different “inductive bias”, one does not expect 
the generalization performance of two classifiers to be identical [14, 16] for difficult pat- 
tern recognition problems, even when they are both trained on the same data set. If 
only the “best” classifier is selected based on an estimation of the true generalization 
performance using a finite test set valuable information contained in the results of the 
discarded classifiers may be lost. Such potential loss of information can be avoided if 
the outputs of all available classifiers are used in the final classification decision. This 
concept has received a great deal of attention recently, and many methods for combining 

*To appear in Pattern Analysis and Applications special issue on “Fusion of Multiple Classifiers.” 



classifier outputs have been proposed [15, 17, 19, 24, 29]. Furthermore, promoting di- 
versity among classifiers prior to combining forms the basis of many strategies, including 
bagging, arcing, boosting and correlation control [6, 31]. 

Approaches to pooling classifiers can be separated into two main categories: (i) 
simple combiners, e.g., voting [3], Bayesian based weighted product rule [22], or av- 
eraging [24, 30], and, (ii) meta-learners, such as arbitration [7] or stacking [34]. The 
simple combining methods are best suited for problems where the individual classifiers 
perform the same task, and have comparable success. However, such combiners are 
more susceptible to outliers and to unevenly performing classifiers. In the second cat- 
egory, either sets of combining rules, or full fledged classifiers acting on the outputs of 
the individual classifiers, are constructed [1, 20, 34]. This type of combining is more 
general, but is vulnerable to all the problems associated with the added learning (e.g., 
overfitting, lengthy training time). 

An implicit assumption in most combining schemes is that each classifier sees the 
same training data or resampled versions of the same data. If the individual classi- 
fiers are then appropriately chosen and trained properly, their performances will be 
(relatively) comparable in any region of the problem space. So gains from combining 
are derived from the diversity among classifiers rather that by compensating for weak 
members of the pool (i.e., variance reduction) [11]. However, there are situations where 
individual classifiers may not have access to the same data. Such conditions arise in 
certain data mining, sensor fusion and electrical logging (oil services) problems where 
there are large variabilities in the data which is acquired locally and needs to be pro- 
cessed in (near) real time at geographically separated places [9]. These conditions create 
a pool of classifiers that may have significant variations in their overall performance. 
Moreover, they may lead to conditions where individual classifiers have similar average 
performance, but substantially different performance over different parts of the input 
space. 

In such cases, combining is still desirable, but neither simple combiners nor meta- 
learners are particularly well-suited for the type of problems that arise. For example, 
the simplicity of averaging the classifier outputs is appealing, but the prospect of one 
poor classifier corrupting the ensemble makes this a risky choice. Weighted averaging of 
classifier outputs appears to provide some flexibility [18, 23]. Unfortunately, the weights 
are still assigned on a per classifier basis rather than a per sample or per class basis. If 
a classifier is accurate only in certain areas of the input space, this scheme fails to take 
advantage of the variable accuracy of the classifier in question. Using a meta learner 
that provides different weights for different patterns can potentially solve this problem, 
but at a considerable cost. In particular, the off-line training of a meta-learner using 
substantial amount of data outputted by geographically distributed classifiers, may not 
be feasible. In addition to providing robustness, the order statistic combiners presented 
in this work also aim at bridging the gap between simplicity and generality by allowing 
the flexible selection of classifiers without the associated cost of training meta-classifiers. 

Section 2 summarizes the relationship between classifier errors and decision bound- 
aries and provides the necessary background for mathematically analyzing order statis- 



tic combiners [30, 32]. Section 3 introduces simple order statistic combiners. Based on 
these concepts, in Section 4 we investigate two powerful combiners, trim and spread , 
and derive the amount of error reduction associated with each. In Section 5 we present 
the performance of order statistic combiners on a real world sonar problem [15], and 
several data sets from the Probenl/UCI benchmarks [4, 25]. Section 6 discusses the 
implications of using linear combinations of order statistics as a strategy for pooling the 
outputs of individual classifiers. 

2 Background 

In this section we first summarize the approach in [30, 32] 1 , where a framework to 
quantify the effect of inaccuracies in estimating a posterior class probabilities on the 
classification error was introduced. This background is needed to characterize and 
understand the impact of order statistics combiners, as described in Sections 3 and 4. 
It also introduces the necessary notations and definitions. Then we briefly the basic 
concepts and properties of order statistics. 

2.1 Relationship Between a Posteriori Probability Estimates and 
Classification Error for a Single Classifier 

A wide variety of classification models not only provide the suggested class label for 
a given input, but an estimation of the posterior class probabilities for that input as 
well. For example, it is well known that, given one-of-L desired outputs and sufficient 
training, the outputs of a multilayered perceptron network or a radial basis function 
network based classifier trained to minimize a mean square error criteria, approximate 
the a posteriori probabilities of the corresponding classes [26]. This result is based on 
the universal approximation capabilities of the underlying function approximators. 

For such classification models, one can represent the ith output (corresponding to 
class i ) of the classifier as: 

fi(x) = Pi(x) + €i(x), ( 1 ) 

where Pi(x) is the true posterior for ith class on input x, i.e., Pi{x) = P(G\\x), and ej(x) 
is the error of the classifier in estimating that posterior. 

Figure 1 illustrates the true (solid lines) and approximated (dashed lines) posterior 
probabilities for classes i and j, given a one dimensional input, x. From Bayes decision 
theory, the ideal class boundary is x* , where pi{x) = Pj{x), The realized boundary is a;*,, 
where fi(x) = fj(x). The Bayes eror rate is related to the lightly shaded region, while 
the extra classification error , (called model error or E mo dei ), incurred because of our 
imperfect classifier, is determined by the darkly shaded region, whose size is governed 
by 6, the offset between x* and x&. To quantify this relationship, we first break down 
the posterior probability approximation error in Eq. 1 into two parts: 

€i(x) = Pi + T)i(x). ( 2 ) 


^hese and other related papers can be downloaded from URL http://www.lans.ece.utexas.edu. 




Figure 1: Error regions associated with approximating the a posteriori probabilities [30]. 

The first component is the bias or offset in estimating the aposterior probability of 
class i, and does not vary with the input. The second component gives the variability 
from the systematic offset error, for different input samples, and has zero mean and 
variance These two components of the error are similar to the bias and variance 

decomposition for a quadratic loss function given in [14], although they are at the 
individual input level. We will therefore refer to classifiers as “biased 1 ’ and “unbiased” 
implying 0k ^ 0 for some k , and 0k — 0 , Vfc, respectively. 

Now, the boundary offset ( b = x b —x*) is a random variable with probability density 
function p b (x), whose mean and variance are given by: 

a - @±-@1 


2 2 
a Vi (x) + a rjj(x ) 


as derived in [30, 32]. Also, let the first and second moments of the boundary offset b 
be represented by 


Mi = 


xpb(x)dx and M 2 = 


2 p b (x)dx 


By integrating over the densely shared area of Fig. 1 weighted by Pb(x), one can 
show that the extra model error can be simply expressed as: 

Emodel = S -0^- ~ | (<?f +P 2 )- (5) 

Note that if the classifier is not biased, the second term will drop out. To emphasize 
the distinction between biased and unbiased classifiers, the model error will be given as 
a function of 0 for biased classifiers in Section 3. 



The above framework sets the stage for evaluating the effects of combining multiple 
classifiers. In [30] we studied what happens to the model error, E™* del of an averaging 
combiner, where for each class i , the approximated posterior probabilities /™(x), 1 < 
m < N of N individual classifiers are averaged, and then the class with the highest value 
of this average is chosen as the winner. In this paper, we perform a somewhat more 
involved derivation of what happens to the model error E™ odel , when the combination 
is based on order statistics rather than simple averaging. 

2.2 Background on Order Statistics 

In this subsection, we briefly discuss some basic concepts and properties of order statis- 
tics. Let X be a random variable with probability density function px('), and cumula- 
tive distribution function Let (Xi,X2, • • • , Xn) be a random sample drawn from 

this distribution. Now, let us arrange them in non-decreasing order, providing: 

Xl:N < X2 :N < * ' * £ Xj\f : ]Sf. 

The ith order statistic denoted by X^n, is the ith value in this progression. The cumu- 
lative distribution function for the smallest and largest order statistic can be obtained 
by noting that: 

FxsM = P(Xn-.n <x) = n ^PiXi-.N <x) = [Fx(x)] n 

and: 

Fx 1 :N (x) = P(X 1 :N <x) = l-P(X 1 :N >x) = l-n? =1 P(X i;N >x) 

= 1 - (1 - P(X i:N < x)) = 1 - [1 - Fx (x)} N 

The corresponding probability density functions can be obtained from these equations. 
In general, for the ith order statistic, the cumulative distribution function gives the 
probability that exactly i of the chosen X's are less than or equal to x. The probability 
density function of Xi,N is then given by [10]: 

Px<M = (i-l)f(jv- l)! [1 - FxWf-'pxix) . (6) 

This general form however, cannot always be computed in closed form. Therefore, 
obtaining the expected value of a function of x using Equation 6 is not always possible. 
However, the first two moments of the density function are widely available for a variety 
of distributions [2]. These moments can be used to compute the expected values of 
certain specific functions, e.g., polynomials of order less than two. 

3 Order Statistics Ensembles 

Now, let us turn our attention to designing an ensemble of N classifiers using order 
statistics (OS) concepts for combining the outputs. For a given input x, let the outputs 
of each of the N classifiers for each class i be ordered in the following manner: 

f! :N (x)<f? :N (x)< ••• <f?- N (*). 



Then one constructs the A;th order statistic combiner, by selecting the fcth ranked output 
for each class (/* :7V (x)), as representing its posterior. 

In particular, max , med and min combiners are defined as follows: 


fr x (x) 


/r d (x) 


fr n (x) 


fi :N {x). 


if N is even 
if N is odd, 


(7) 

( 8 ) 
(9) 


These three combiners are relevant because they represent important qualitative in- 
terpretations of the output space. Selecting the maximum combiner is equivalent to 
selecting the class with the highest posterior. Indeed, since the network outputs approx- 
imate the class a posteriori distributions, selecting the maximum reduces to selecting 
the classifier that is the most “certain” of its decision. The drawback of this method 
however is that it can be compromised by a single classifier that repeatedly provides 
high values. The selection of the minimum combiner follows a similar logic, but focuses 
on classes that are unlikely to be correct, rather than on the correct class. Thus, this 
combiner eliminates less likely classes by basing the decision on the lowest value for a 
given class. This combiner suffers from the same ills as the max combiner. However, it 
is less dependent on a single error, since it performs a min-max operation, rather than 
a max-max 2 . The median classifier on the other hand considers the most “typical” rep- 
resentation of each class. For highly noisy data, this combiner is more desirable than 
either the min or max combiners since the decision is not compromised as much by a 
single large error. 

The analysis that follows does not depend on the particular order statistic chosen. 
Therefore, we will denote all OS combiners by f£ 8 (x) and derive the model error, E™ odel . 
The network output provided by f£ s (x) is given by: 

r k °(x)=p k (x)+e?(x), (10) 

We shall first analyze the case when there is no bias, and then consider the more in- 
volved situation when at least one classifier provides a biased estimate of the a posteriori 
class probabilities. 


3.1 Combining Unbiased Classifiers through Order Statistics 

For the zero-bias case (Pk — 0 , Vfc), we get el s (x) = r]% 8 (x). Proceeding as in Section 2, 
the boundary b os is shown to be: 

h0S = (11) 
s 

For i.i.d. rjk s, the first two moments will be identical for each class. Moreover, taking 
the order statistic will shift the mean of both r]° 8 and rjj 8 by the same amount, leaving 


2 Recall that the pattern is ultimately assigned to the class with the highest combined output. 



2 

a b° { 


the mean of the difference unaffected. Therefore, b os will have zero mean, and variance: 

2 a k a _ 2a 4k _ 2 QO) 

—2~ ~ g2 - a(7 b > ( 12 ) 

where a is a reduction factor that depends on the order statistic and on the distribution 
of b. For most distributions, a can be found in tabulated form [2]. For example, Table 1 
provides a values for all order statistic combiners, up to 10 classifiers, for a Gaussian 
distribution [2, 27]. (Because this distribution is symmetric, the a values of l and k 
where l + k = N + 1 are identical, and listed in parenthesis). 

Returning to the error calculation, we have: M™ = 0, and = ofo,, providing: 


p°s 

^ model 


sM% 3 _ sale, _ sacrl _ _ ^ 

2 - 2 “ 2 - a E ™del- 


(13) 


Table 1: Reduction factors a for the Gaussian Distribution, based on [27]. 


N 

k 

a 

N 

k 

a 

N 

k 

a 

1 

1 

1.00 

6 

2(5) 

.280 


1(9) 

.357 

2 

1(2) 

.682 


3(4) 

.246 


2(8) 

.226 

3 

1 (3) 

.560 


1 (7) 

.392 

9 

3(7) 

.186 


2 

.449 

7 

2(6) 

.257 


4(6) 

.171 

4 

1(4) 

.492 


3(5) 

.220 


5 

.166 


2(3) 

.360 


4 

.210 


1 (10) 

.344 


1(5) 

.448 


1(8) 

.373 


2(9) 

.215 

5 

2(4) 

.312 

8 

2 (7) 

.239 

10 

3 (8) 

.175 


3 

.287 


3(6) 

.201 


4(7) 

.158 

6 

1 (6) 

.416 


4 (5) 

.187 


5 (6) 

.151 


Equation 13 shows that the reduction in the error due to using the OS combiner 
instead of the mth classifier is directly related to the reduction in the variance of the 
boundary offset b. Since the means and variances of order statistics for a variety of dis- 
tributions are widely available in tabular form, the reductions can be readily quantified. 


3.2 Combining Biased Classifiers through Order Statistics 

In this section, we analyze the error regions for biased classifiers. Let us return our 
attention to b os . First, note that the error terms can no longer be studied separately, 
since in general ( a + b) os ^ a° 8 + b 08 . We will therefore need to specify the mean and 
variance of the result of each operation 3 . Equation 11 becomes: 

h os _ (fii + iwr - (fa + Vj^bYT / 14 j 

s 

Let 0k = Em=i Pk be the mean of classifier biases. Since 77* ’s have zero-mean, 

(3k + r]k{xb) has first moment 0k and variance a^ k -J- °0k' with = E[m 2 \ - 
where E[*] denotes the expected value operator. 

3 Since the exact distribution parameters of b os are not known, we use the sample mean and the sample 


variance. 



Taking a specific order statistic of Equation 14 will modify both moments. The first 
moment is given by 0k + V os , where ff s is a shift which depends on the order statistic 
chosen, but not on the class. Then, the first moment of b os is given by: 

(Pi + fi os ) - (0j + /Q _ Pi - f3j _ g ^5) 

s $ 

Note that the bias term represents an “average bias” since the contributions due to the 
order statistic are removed. Therefore, reductions in bias cannot be obtained from a 
table similar to Table 1. 

Now, let us turn our attention to the variance. Since 0k 4* r]k(%b) has variance 
a Vh + a 0 h ’ ft follows that (0k + Tlk{xb)) os has variance = a(<r 2 fc 4- cr| fe ), where a is 
the factor discussed in Section 3.1. Therefore, the variance of b os is given by: 


2 

cr b o. 


a v!' + a h‘ _ 2aa li + a h ) 

12 — q2 q2 

aiaf+aj), 


(16) 


where ' is the variance introduced by the systematic errors of different 

classifiers. 

We have now obtained the first and second moments of b 08 , and can compute the 
model error. Namely, we have M { 8 = and a boa — ~ (A4f*) 2 , leading to: 

EZM = \mv = !{<&.+?) (17) 

= \(a{*l+4) + B 2 ). (18) 

The reduction in the error is more difficult to assess in this case. By writing the error 
as: 


EZM = a S -(*Z + ((3)*) + S -( a * 2 +p 2 - a (f3) 2 ), 

we get: 

EZdel(P) = <* Emodel (0) + \ {«o} + 0 2 - a(/3) 2 ). (19) 

Analyzing the error reduction in the general case requires knowledge about the bias 
introduced by each classifier. Unlike regression problems where the bias and variance 
contributions to the error are additive and well-understood, in classification problems 
their interaction is more complex [13]. Indeed it has been observed that ensemble 
methods do more than simply reduce the variance [28]. 

Based on these observations and Equation 19, let us analyze extreme cases. For 
example, if each classifier has the same bias, is reduced to zero and / 3 = 0 . In this 
case the error reduction can be expressed as: 

EZdei(P) = | (™ 2 b +(P) 2 = aE model (0)+ a -^l(0) 2 1 


where a balances the two contributions to the error. A small value for a will reduce the 
first component of the error (mainly variance), while leaving the second term untouched. 



The net effect will be very similar to results obtained for regression problems. In 
this case, it is important to reduce classifier bias before combining (e.g., by using an 
overparametrized model). 

If on the other hand, the biases produce a zero mean variable, we obtain j3 — 0. In 
this case, the model error becomes: 

EZdei(P) = aE model (f3 ) + ^ (o$ - (/ 3 ) 2 ) 

and the error reduction will be significant if the second term is small or negative. In 
fact, if the variation among the biases is small relative to their magnitude, the error will 
be reduced more than in the unbiased cases. If however, the variation is large compared 
to the magnitude, the error reduction will be minimal. Furthermore, if a is large and 
the biases are small and highly varied, it is possible for this combiner to do worse than 
the individual classifiers, which is a danger not present for regression problems. This 
observation very closely parallels results reported in [13]. 


4 Linear Combining of Ordered Classifier Outputs 

In the previous section, we derived error reductions when the class posteriors are directly 
estimated through the ordered classifier outputs. Since simple averaging has also been 
shown to provide benefits, in this section, we investigate the combinations of averaging 
and order statistics for pooling classifier outputs. 

4.1 Spread Combiner 

The first linear combination of ordered classifier outputs we study focuses on extrema. 
As discussed in Section 3.1, the maximum and minimum of a set of classifier outputs 
carry specific meanings. Indeed, the maximum can be viewed as the class for which 
there is the most evidence. Similarly, the minimum deletes classes with little evidence. 
In order to avoid a single classifier from having too large of an impact on the eventual 
output, these two values can be averaged to yield the spread combiner. This combiner 
strikes a balance between the positive and negative evidence, leading to a more robust 
combiner than either of them. 


4.1.1 Spread Combiner for Unbiased Classifiers: 


For a classifier without bias, the spread combiner is formally defined as: 

rw = \ ur N {x) + f, N:N (x)) = Pi (x) + v r(x) , 

where: 


vt pr (x) = \ {vt' N (x) + vr N ( x )) ■ 


The variance of rf i pr (x) is given by: 
1 , 


_2 

a v‘ pr 


± a n} lN {*) 


+ + Icovivrixurix)). 


(20) 


( 21 ) 



where cov( *, •) represents the covariance between two variables (even when the rj^s are 
independent, ordering introduces correlations). Note that because of the ordering, the 
variances in the first two terms of Equation 21 can be expressed in terms of the individual 
classifier variances. Furthermore, the covariance between two order statistics can also 
be determined in tabulated form for given distributions. Table 2 provides these values 
for a Gaussian distribution based on [27]. This expression can be further simplified for 
symmetric distributions where 1:N = g^h.n (e.g., Gaussian noise model) and leads to: 

= - (otl:N + Bi 9 N : n) ( 22 ) 

where a m: jv is the variance of the mth ordered sample and is the covariance 

between the mth and Zth ordered samples, given that the initial samples had unit 
variance [27]. Because this is a symmetric distribution, the 0 values are also symmetric 
(e.g., 01,2:5 = ^4,5:5)- 

Table 2: Some Reduction Factors B for the Gaussian Distribution, based on [27]. 


N 

k,l 

B 

N 

M 

B 

N 

k,l 

B 

N k,l 

B 

2 

1,2 

.318 


2,3 

.189 


1,4 

.095 

1,6 

.059 

3 

1,2 

.276 

6 

2,4 

.140 


1,5 

.075 

1,7 

.049 


1,3 

.165 


2,5 

.106 


1,6 

.060 

1,8 

.040 


1,2 

.246 


3,4 

.183 


1,7 

.048 

1,9 

.031 

4 

1,3 

.158 


1,2 

.196 


1,8 

.037 

2,3 

.154 


1,4 

.105 


1,3 

.132 


2,3 

.163 

2,4 

.117 


2,3 

.236 


1,4 

.099 

8 

2,4 

.123 

2,5 

.093 


1,2 

.224 


1,5 

.077 


2,5 

.098 

2,6 

.077 


1,3 

.148 


1,6 

.060 


2,6 

.079 

9 2,7 

.063 

5 

1,4 

.106 

7 

1,7 

.045 


2,7 

.063 

2,8 

.052 


1,5 

.074 


2,3 

.175 


3,4 

.152 

3,4 

.142 


2,3 

.208 


2,4 

,131 


3,5 

.121 

3,5 

.114 


2,4 

.150 


2,5 

.102 


3,6 

.098 

3,6 

.093 


1,2 

.209 


2,6 

.080 


4,5 

.149 

3,7 

.077 


1,3 

.139 


3,4 

.166 


1,2 

.178 

4,5 

.137 

6 

1,4 

.102 


3,5 

.130 


1,3 

.121 

4,6 

.113 


1,5 

.077 ' 


1,2 

.186 

9 

1,4 

.091 




1,6 

.056 

8 

1,3 

.126 


1,5 

.073 




Then, using Equation 4, the variance of the boundary offset b spr can be calculated: 




2 2 

+ 0‘ rfj 3p r 


2 ( a l:W + Bl,N:N) °i- 


(23) 


Finally, through Equation 5, we can obtain the reduction in the model error due to the 







spread combiner: 


E model _ Q 1 :N ± ( 24 ) 

Emodel 2 

Based on Equation 24 and Tables 1 and 2, Table 3 displays the error reductions provided 
by the spread combiner for a Gaussian noise model (for comparison purposes, the error 
reduction for the min and max combiners is also provided. Note that for the Gaussian 
distribution, the error reduction of min is equal to that of max.). 

Table 3: Error Reduction Factors for the Spread, min and max Combiners with Gaussian 
Noise Model. 


N 

spread 

min or max 

2 

.500 

.682 

3 

.362 

.560 

4 

.299 

.492 

5 

.261 

.448 

6 

.236 

.416 

7 

.219 

.392 

8 

.205 

.373 

9 

.194 

.357 

10 

.186 

.344 


4.1.2 Spread Combiner for Biased Classifiers: 

Now, if the classifier biases are non-zero, the spread combiner’s output is given by: 


rw = § ur(x) + /*(*)) = Pi( x ) + (*(*)+ a)" 

In that case, the boundary offset is given by: 

b spr _ (ft + TjjjXbfy ' Pr ~ (ft + Vj{Xb)) SpT 
$ 

which after expanding each term and regrouping can be expressed as: 

(A±Vi(xb)) 1:N - (ft + Vj{xb)) 1:N 


(25) 


(26) 


b *pr _ 


2s 

\N:N 




(Pi + Vi(xb)) N:N - ifij ± Vj(x b )) N: 

2 S 


N:N 


(27) 


The first moment of b spr can be obtained by analyzing each term of Equation 27. 
In fact, the offset introduced by the first and nth order statistic for classes i and j 
will cancel each other out, leaving only the average bias between the min and max 

0l :N _pUN + pN :N _pN:N 

components of the error (as in Equation 15), given by P spr = — 2 2 — . 

The variance of b spr needs to be derived from Equation 27. Proceeding as in Equa- 
tion 16, the variance of the spread combiner can be expressed as: 


(28) 


a l’rr = ( 4 a l:iV + -^(*N:N + ^ ■5l,JV:Ar)(<7j + cr|). 

For a symmetric distribution (where ai-.N — c*N:n), we obtain the following error: 

EZM = \m 2 '= |(4»- + aO 

= | ^(|«l:iV + ,N-.N)(crl +<Tp) + (0 3pr ) 2 ^J 

= 2 ( a l-N + Bl, N :N)Emodel(/3) + 

+ B 1 , N:N )(al - ( f3 ) 2 ) + |(^) 2 . (29) 

which is very similar to Equation 19, where the value of a for a single order statistic is 
now replaced by , since the mean of the first and nth order statistic is used 

in the posterior estimate. 


4.2 Trimmed Means 

Instead of actively using the extreme values as was the case with the spread combiner, 
one can base the posterior estimate around the median values. However, instead of 
selecting one classifier output as was done for f med , one can use multiple classifiers 
whose outputs are “typical.” In this scheme, only a certain fraction of all available 
classifiers are used for a given pattern. The main advantage of this method over weighted 
averaging is that the set of classifiers which contribute to the combiner vary from pattern 
to pattern. Furthermore, they do not need to be determined externally, but are a 
function of the current pattern and the classifier responses to that pattern. 


4.2.1 Trimmed Mean Combiner for Unbiased Classifiers: 

Let us formally define the trimmed mean combiner (fa — 0, Vfc) as follows: 

1 N * 

= N2 _ Nl + 1 E /r :N (*) = Piix) + r,r im (x ) , 


m=N i 


where: 


Ni 

= jirrTfTTT S ' 

m=N i 


The variance of rf i rtTn (x) is given by: 

1 n 2 n 2 

= (jv 2 _ 7 V 1+ 1)2 E E cov( V r N (x),4- N (x)) 
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(30) 
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1 } \m=N l m=iVi l>m J 



Again, using the factors in Tables 1 and 2, Equation 31 can be further simplified. Note 
that because the Gaussian distribution is symmetric, the covariance between the fcth 
and /th ordered samples is the same as that between the N - hi- kth and N + 1 — 1th 
ordered samples. Therefore, Equation 31 leads to: 


^ _ 


(N 2 


1 N 2 

-X+i ) 2 am:N 
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m^=Ni 

N 2 


+ (N 2 -Ni + l) 2 ^ E a m(*) ’ 

v ' m=Ni l>m 


(32) 


where a m; jv is the variance of the mth ordered sample and B m j.N is the covariance 
between the mth and Zth ordered samples, given that the initial samples had unit 
variance [27]. Using the theory highlighted in Section 2, and Equation 32, we obtain 
the following model error reduction: 


T?trim 
& model 

Emodel 


1 

(JV 2 -Ni + l) 2 



n 2 n 

+ 2 £ 
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Based on Equation 33 and Tables 1 and 2, we have generated a sample trim com- 
biner reduction table. Because there are many possibilities for N\ and N 2 , a table that 
exhaustively provides all reduction values is not practical. In this sample table we have 
selected N\ = 2 and iV 2 = N - 1, that is, averaging after the lowest and highest values 
have been removed. For comparison purposes the reduction factors of the averaging 
combiner for N and N — 2 classifiers are also provided (for i.i.d. classifiers the reduction 
factors are 1/N as derived in [30, 32]; similar results were obtained for regression prob- 
lems [24]). As these numbers demonstrate, although N — 2 classifiers are used in the 
trim combiner, selectively weeding out undesirable classifiers provides reduction factors 
significantly better than simply averaging N — 2 arbitrary classifiers. The trim combiner 
provides reduction factors comparable to the N classifier ave combiner without being 
susceptible to corruption by one particularly faulty classifier. 


Table 4: Error Reduction Factors for Trim and two corresponding ave Combiners with 
Gaussian Noise Model. 


N 

ave (for N) 

trim (for Ni = 2 ; N 2 = N — 1) 

ave (for N — 2) 

3 

.333 

.449 

1.00 

4 

.250 

.298 

.500 

5 

.200 

.227 

.333 

6 

.167 

.184 

.250 

7 

.143 

.155 

.200 

8 

.125 

.134 

.167 

9 

.111 

.113 

.143 



4.2.2 Trimmed mean Combiner for Biased Classifiers: 


Now, if the classifier biases are non-zero, the trimmed mean combiner’s output is given 
by: 


ft rim {x) = N -N +1 £ fr ' N{x) = Pi{x) + iViix) +0i)tTim ■ (34) 

^ 1 m=Ni 

In that case the boundary offset is given by: 

b trim = (fit + - (Pi ± rhM) trim (35) 

s 

The first moment of b trxm can be obtained from a manner similar to that of the 
spread combiner. Indeed, each mean offset introduced by a specific order statistic for 
class i will be offset by the one introduced for class j. Only the trimmed mean of the 
biases will remain, giving the first moment of b tr%m : 
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In deriving the variance of b trtm , we follow the same steps as in Sections 3.2 and 
4.1.1. The resulting boundary variance is similar to Equation 16, but the since the 
reduction is due to the linear combination of multiple ordered outputs, a is replaced by 
A, where: 


A = 


1 

(N 2 - JVi + l) 2 



n 2 > 

+ 2 ^2 ^2 Bm,l.N 

m=N i l>m ) 


The model error reduction in this case is given by: 


(37) 


E^el(0) = = |(4-™ + M x 2 ) 

= S i(A (v 2 b +4) + (0 spr ) 2 ) 

= A E model (0) + S -{A {4 - (0) 2 ) + ( 0 ° pr ) 2 ) ■ (38) 

Once again we need to look at the interaction between the two parts of the error 
reduction. The first term provides the error reduction compared to the model error of 
an individual classifier. The smaller A is, the more error reduction there will be. In the 
second term, on the other hand, a small value for A is only useful if the variability in 
the individual biases is higher than the biases themselves (<j| > (/3) 2 ). 


5 Experimental Results 

The order statistics-based combining methods proposed in this article are tailored for 
situations where one or more of the following apply: 

1. Individual classifier performance is uneven and class dependent; 



2. It is not possible (insufficient data, high amount of noise) to fine tune the individual 
classifiers without using computationally expensive methods; 

3. All the features are not available to all the classifiers. 

Such situations occur, for example, in electrical logging while drilling for oil, where 
data from certain well sites almost completely misses out on portions of the problem 
space, and in imaging from airborne platforms where the classifiers receive inputs from 
different satellites and/or different types of sensors (e.g., thermal, optical, SAR). In this 
article we restrict ourselves to public domain data sets and simulate such variability in 
two ways, namely, by 

(i) segmenting the feature set and allowing individual classifiers to have access to only 
a limited portion of the feature set. 

(ii) using “early stopping” i.e., prematurely terminating the training of the individual 
classifiers 4 . 

For the experiments reported below, we used a multi-layer perceptron (MLP) with a 
single hidden layer, whose weights were randomly initialized for each run. All classifica- 
tion results reported in this article are test set error rates averaged over 20 runs, along 
with the differences in the mean ( standard deviation divided by square root of the num- 
ber of runs). Several types of simple combiners such as averaging, weighted averaging, 
voting, products, weighted products (Bayesian), using Dempster-Schafer theory of evi- 
dence, and entropy-based averaging, have been proposed in the literature. However, on 
a wide variety of data sets, it has been observed that simple averaging usually provides 
results comparable to any of these techniques (and, surprisingly, often better than most 
of them) [15, 31]. Furthermore, many ensemble techniques (such as subsampling the 
training set as in bagging) can be performed in conjunction with order statistics just as 
well as they can be performed with averaging or voting. For this reason, in this study, 
we use the average combiner as a representative of simple combiners, for comparison 
purposes. 


5.1 Variability through Segmentation 

The first group of experiments focus on classifiers that because of circumstances (e.g., 
geography) have access to only a part of the full feature set. Although this situation 
is becoming quite common [21, 33], we are not aware of any public domain data sets 
for collective data mining. Instead we will create variability in three data sets from the 
Probenl/UCI benchmarks [4, 25]. Briefly these data sets, and the corresponding size 
of the MLP used, are 5 : 

• Card: a 51-dimensional, 2-class data set based on credit approval decision with 
690 patterns; an MLP with 10 hidden units; 

• Gene: a 120-dimensional data set with two classes, based on the detection of splice 
junctions in DNA sequences, with 3175 patterns; an MLP with 10 hidden units; 

4 In all the experiments reported here, “early stopping” means that classifiers in an ensemble were trained 
half as along as they would have been, had they been stand-alone classifiers. 

5 The number of hidden units was determined experimentally. 



• Satellite: a 36-dimensional, 6-class data set with 6435 examples of feature vectors 
extracted from satellite imagery; an MLP with 20 hidden units. 

These three sets were chosen as they have relatively large number of features, somewhat 
large number of data points, and have been studied by several researchers. Also note 
that the Probenl benchmarks are particular training, validation and test splits of the 
UCI data sets which are available from URL http://www.ics.uci.edu/ ~mlearn/MLRe- 
pository.html. The results presented in this article are based on the first training, 
validation and test partition discussed in [25], where half the data is used for training, 
and a quarter each for validation and testing purposes. 

We investigate two situations: one where the original features were randomly and 
disjointly partitioned among the different segments, and the second where there is some 
overlap among features in different segments. The exact segment count and number of 
features within each segment is specified in Table 5. 

For each data set, we present the original number of features, the number of new 
features sets that result when the feature set is segmented (for Gene we only have 
two new sets, because the low dimensionality prevents any further segmentation), and 
the resulting number of features in each segment with and without overlap among the 
features. 

A classifier trains on data from one segment, and different classifiers operate on 
different segments. When the number of classifiers in the ensemble was higher than 
the number of segments (N = 8) more than on classifier (starting from a different 
initialization) was trained on the same features. 


Table 5: Number of features in Probenl/UCI data sets 


Data 

Number of Original 
Features 

Number of 
Segments 

Features per Segment 

no Overlap 

Overlap 

Card 

51 

4 

13-13-13-12 

18-18-18-18 

Gene 

120 

4 

30-30-30-30 

40-40-40-40 

Sat 

36 

4 

9-9- 9-9 

15-15-15-15 


Tables 6-7 present the results (with the best result for each case in bold font). The 
misclassification percentage for individual classifiers are reported in the first column. For 
the trimmed mean combiner, we also provide Ni and N^, the upper and lower cutting 
points in the ordered average used in Equation 30, obtained through the validation set. 

In this case, for two of the three data sets (Gene and Card), there are striking 
gains due to using order statistics combiners. One cause for these gains is the high 
variability in performance among the component classifiers. In such cases, a small 
number of poor classifiers can corrupt the average combiner. By their very nature, 
though, combiners based on order statistics are immune to this type of corruption. The 
ave combiner performs well on the Sat data sets where the performance among the 
individual classifiers is much more homogeneous. In this case, the ave results are only 
marginally worse than those for the trimmed mean. 



Table 6: Segmented Features with overlap (% misclassified ±a/y/n). 


Data 

N 

Ave 

Max 

Min 

Spread 

Trim {Ni-Ni) 

Card 

4 

12.21 ±.00 

10.58 ±.06 

10.61 ±.06 

10.58 ± .00 

12.21 ±.00 (3-4) 

30.30 ±2.62 

8 

12.21 ±.00 

10.47 ±.00 

10.61 ±.06 

10.47 ± .00 

10.47 ± .00 (7-8) 

Gene 

4 

18.52 ±.10 

14.02 ±.13 

20.23 ±.31 

14.72 ±.15 

16.86 ±.15 (3-4) 

34.80 ±4.01 

8 

18.06 ±.06 

13. 13 ±.06 

17.59 ± .17 

13.69 ± .11 

13.39 ±.08 (7-8) 

Sat 

4 

14.16 ±.08 

14.73 ±.18 

14.64 ±.16 

14.24 ±.12 

14.00 ± .07 (3-4) 

16.40 ±0.56 

8 

14.21 ±.05 

15.27 ± .15 

15.07 ± .15 

14.49± .11 

14.01 ± .04 (3-5) 


Table 7: Segmented Features without overlap (% misclassified ±a fy/n). 


Data 

N 

Ave 

Max 

Min 

Spread 

Trim {N x -N 2 ) 

Card 

4 

12.21 ±.00 

10.49 ± .03 

10.49 ± .03 

10.49 ±.03 

12.21 ± .00 (3-4) 

30.90 ±2.66 



8 

12.21 ±.00 

10.78 ±.07 

10.78 ±.07 

10.52 ±.04 

11. 05 ±.00 (7-8) 

Gene 

4 

24.35 ±.13 

15.82 ±.15 

19.15 ± .22 

14.09 ±.11 

23.11 ±.15 (3-4) 

36.87 ±3.01 

8 

23.33 ±.19 

14.99 ±.14 

16.78 ±.24 

13. 23 ±.15 

15.03 ±.17 (7-8) 

Sat lap 

4 

14.39 ±.09 

15.66 ±.15 

15.46± .11 

15.11 ± .11 

14.22 ±.07 (2-3) 

17.13 ±0.47 

8 

14.37 ±.05 

15.93 ±.06 

15.53 ±.06 

15.18 ± .10 

14.04 ± .05 (3-5) 


5.2 Variability through Early Stopping 

For the second set of experiments we use two classes of acoustic underwater sonar 
signals 6 . From the original sonar signals of four different underwater objects (porpoise 
sound, cracking ice and two different whale sounds), two feature sets are extracted [15]: 

• WOC: a 25-dimensional feature set, consisting of Gabor wavelet coefficients, tem- 
poral descriptors and spectral measurements; and, 

• RDO: a 24-dimensional feature set, consisting of reflection coefficients based on 
both short and long time windows, and temporal descriptors. 

For both feature sets, an MLP with 50 hidden units was used. These data sets are 
available at URL http://www.lans.ece.utexas.edu. Further details about this 4-class 
problem can be found in [15, 31]. 

Table 8 presents the combining results for the underwater acoustic data set when 
the individual classifier performance is highly variable. The results of Table 8 as well 
as those given in [32] indicate that when the individual classifier performance is highly 
variable, order statistics-based combiners (particularly the spread combiner) typically 
provide better classification results than other simple combiners. This performance 
improvement is obtained without sacrificing the simplicity of the combiner. On the 
other hand, no single methoed based on order statistics was consistently better than 
the simple combiner. Thus there is no sure bet, but one can in practice benefit from 
using order statistics based combiners in at least two ways: (i) Since either the max or 


6 Detailed results on 6 Proben/UCI data sets were reported in [32] and hence are not repeated here. 



Table 8: Combining Results in the Presence of High Variability in Individual Classifier 
Performance for the Sonar Data (% misclassified ±a/yfn ). 


Data 

O 

Ave 

Max 

Min 

Spread , 

Trim (JVi-JVaj 

RDO 

4 

11.57 ± .11 

11.94 ±.12 


11.04 db .09 

11. 34 ±.14 (3-4) 

13.32 ±0.83 

8 


11.47 ± .11 


11.51 ±.09 

12.30 ±.08 (4-5) 

woe 

4 

8.80 ±.09 


9.31 ±.12 


8.43 ±.13 (3-4) 

12.07 ± 1.12 

8 

8.82 ±.08 

7.68 ±.12 

8.91 ± .06 

8.24± .11 



min combiner will usually provide better classification rates than ave y but it is difficult 
to determine which of the two will be more successful for a specific data set, one can 
try both and use a validation set to select one over the other, (ii) Alternatively one can 
just use the spread combiner since it consistently ranks among the best order statistics 
results. 

6 Conclusion 

In this article we present and analyze combiners based on order statistics. These com- 
biners blend the simplicity of averaging with the generality of meta-learners. They are 
particularly effective if there are significant variations among component classifiers in 
at least some parts of the joint input-output space. Variations can arise when the indi- 
vidual training sets cannot be considered as random samples from a common universal 
data set. Examples of such cases include real-time data acquisition and classification 
from geographically distributed sources or data mining problems with large databases, 
where random subsampling is computationally expensive and practical methods lead to 
non-random subsamples [5]. Furthermore, the robustness of order statistics combiners 
is also helpful when certain individual classifiers experience catastrophic failures (e.g., 
due to faulty sensors). 

The analytical framework provided in this paper quantifies the reductions in error 
achieved when an order statistics based ensemble is used. It also suggests that the 
two methods for linear combination of order statistics introduced in this paper should 
provide more reliable estimates of the true posteriors than any of the individual order 
statistic combiners. While interpreting the results one must bear in mind the simplifying 
assumptions underlying this framework. Perhaps the biggest assumption is that the 
distribution of error ti{x) in Eq. 1 is i.i.d across all classifiers. Furthermore, we assume 
that for a given classifier and input, €i{x) is independent across all classes. The latter 
assumption is clearly not true if one normalizes the outputs to always sum to one, in 
which case one degree of freedom is lost. One can avoid both these assumptions through 
a more involved model, but it significantly complicates derivations without producing 
any additional insight. 

The experimental results of Section 5 indicate that when there is high variability 
among the classifiers, the order statistics-based combiners significantly outperform sim- 
















pie combiners, whereas in the absence of such variability these combiners perform no 
worse. Thus the family of order statistic combiners is able to extract an appropriate 
amount of information from the individual classifier outputs without requiring tuning 
additional parameters as in meta-learners, and without being substantially affected by 
outliers. 

A source of variability not investigated in this paper is when different and diverse 
feature sets are used to describe the same set of underlying physical phenomena. Ex- 
perimental results show that combining can exploit such variability to further improve 
accuracy [12, 15, 8], and it will be instructive to see how order statistics based combiners 
fare in such scenarios. 
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